Discovering Minimal Infrequent Structures from XML Documents

نویسندگان

Wang Lian

Nikos Mamoulis

David Wai-Lok Cheung

Siu-Ming Yiu

چکیده

More and more data (documents) are wrapped in XML format. Mining these documents involves mining the corresponding XML structures. However, the semi-structured (tree structured) XML makes it somewhat difficult for traditional data mining algorithms to work properly. Recently, several new algorithms were proposed to mine XML documents. These algorithms mainly focus on mining frequent tree structures from XML documents. However, none of them was designed for mining infrequent structures which are also important in many applications, such as query processing and identification of exceptional cases. In this paper, we consider the problem of identifying infrequent tree structures from XML documents. Intuitively, if a tree structure is infrequent, all tree structures that contain this subtree is also infrequent. So, we propose to consider the minimal infrequent structure (MIS), which is an infrequent structure while all proper subtrees of it are frequent. We also derive a level-wise mining algorithm that makes use of the SG-tree (signature tree) and some effective pruning techniques to efficiently discover all MIS. We validate the efficiency and feasibility of our methods through experiments on both synthetic and real data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Toward Semantic XML Clustering

The increasing availability of heterogeneous XML informative sources has raised a number of issues concerning how to represent and manage semistructured data. Although XML sources can exhibit proper structures and contents, differently annotated XML documents may in principle encode related semantics due to subjective definitions of markup tags. Discovering knowledge to infer semantic organizat...

متن کامل

Discovering Pattern-Based Dynamic Structures from Versions of Unordered XML Documents

Existing works on XML data mining deal with snapshot XML data only, while XML data is dynamic in real applications. In this paper, we discover knowledge from XML data by taking account its dynamic nature. We present a novel approach to extract patternbased dynamic structures from versions of unordered XML documents. With the proposed dynamic metrics, the pattern-based dynamic structures are exp...

متن کامل

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...

متن کامل

XML structural delta mining: Issues and challenges

Recently, there is an increasing research efforts in XML data mining. These research efforts largely assumed that XML documents are static. However, in reality, the documents are rarely static. In this paper, we propose a novel research problem called XML structural delta mining. The objective of XML structural delta mining is to discover knowledge by analyzing structural evolution pattern (als...

متن کامل

Discovering Restricted Regular Expressions with Interleaving

Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that canno...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Discovering Minimal Infrequent Structures from XML Documents

نویسندگان

چکیده

منابع مشابه

Toward Semantic XML Clustering

Discovering Pattern-Based Dynamic Structures from Versions of Unordered XML Documents

خوشه‌بندی فراابتکاری اسناد فارسی اِکس‌اِم‌اِل مبتنی بر شباهت ساختاری و محتوایی

XML structural delta mining: Issues and challenges

Discovering Restricted Regular Expressions with Interleaving

عنوان ژورنال:

اشتراک گذاری